home *** CD-ROM | disk | FTP | other *** search
-
-
- Minutes of the June 21
- Message format extensions working group.
-
-
- Attendees
- ---------
-
- Phill Gross pgross@nis.ans.net
- Peter Svanberg psu@nada.kth.se
- Byungnam Chung bnchung.sokri.etra.re.kr
- Bob Kummerfeld bob@ca.pn.oz.au
- Jonny Eriksson bygg@sunet.se
- Jan Michael Rynning jmr@nada.kth.se
- Keld Simonsen keld.simonsen@dkuug.dk
- Greg Vaudreuil gvaudre@nri.reston.va.us
-
- Agenda
- ------
-
- 1) Character Set Selection
-
- - Status and Input to the ISO 10646 process
- o Unicode <=> ISO 10646 Union?
- o Use of CO and C1 codespace
-
- - Selection of "Common" character sets or schemes
- o ISO 8859-1, ISO 8859-n, Profiles for the use of ISO 2022?
- o Specifying "requiredness"
-
- - Specification of 8 bit character sets in headers
-
- Minutes
- -------
-
- 1) Character Set Issues
-
- a) Unified character set
-
- 1) Administrative
-
- At last word, the ISO DIS 10646 received 9 YES votes and 14 NO
- votes, and work is proceeding to resolve the remaining issues. An
- unofficial but promising effort is the work underway to unify ISO
- DIS 10646 and Unicode, another scheme for a global character set.
- This effort is being conducted outside the normal ISO
- process. This working group was asked to discuss this effort and
- endorse it if possible. The working group discussed this effort,
- and agreed that the efforts to combine Unicode and 10646 were in
- fact positive.
-
- 2) Technical
-
- The unification of ISO DIS 10646 and Unicode requires the
- resolution of several technical issues. The primary
- issue,tentatively resolved involves "Han unification" a scheme that
- re-uses many of the graphics of the various Kanji character sets.
- Other issues involve the use of CO and C1 codespace. The use of
- C0 and C1 codespace involves transport issues and this working
- group was asked for its input.
-
- C0 codespace consists of the spaces between 0 and 31 and
- 127,traditionally used for control characters. There is a proposal
- to use this space in the second octet of a multi-byte character for
- graphic characters. The working group discussed this and rejected
- the use of this space. A graphic character in the C0 space will
- likely be interpreted by a transport protocol as a control
- character. Many transport protocols which interpret in-band data
- such as SMTP may behave unpredictably in this situation. One
- example is where the sequence of graphics legally sent by a 8 bit
- sender may be mis-interpreted by a 7 bit receiver after bit
- stripping as a 13-10-46-13-10 sequence terminating the SMTP session
- prematurely. Other related anomalies were envisioned. Unless all
- transport protocols are made aware of the multi-byte nature of the
- data, an unlikely occurrence any time soon, reuse of C0 space is
- not recommended.
-
- C1 codespace consists of the spaces between 128-150, space that may
- be interpreted as control characters if the high order bit is
- stripped. ISO 8859-n character sets, and the current 10646
- proposal reserve this space for control characters only, with an
- eye toward backward compatibility with 7 bit systems. The working
- group discussed this and concluded that use of C1 codespace could
- be used for graphics if transport protocols could be relied upon
- to never strip the high order bit and interpret the resulting
- character as control sequences. The working group did not make a
- specific recommendation, only that the use of C1 space to compact
- a character set was a positive thing, and future evolution
- transport protocols should support the use of this space for
- graphics.
-
-
- b) Common Character Sets
-
- In the absence of a single international standard character set,the
- working group needs to profile the use of a limited number of the
- 200+ character sets in use worldwide to facilitate interoperation.
- Keld S. gave an overview of the current character sets in usage.
-
- ISO 7 bit family:
- ASCII
- National Versions
- 10 National use
- 2 Alternate rep # $
- ECMA registry
- 7, 8, 16 bit
- ISO 2022 shifts
-
- ISO 8 bit 8859 family:
- 1 char = 1 octet
- ASCII in pos 0-127
- Pos 160-255
- Latin sets (5)
- Cyrillic
- Greek
- Arabic
- Hebrew
-
- ISO 6937-2 family 8/16 bit:
- 6937-2, T.61
- Non-Spacing accents
- 1 char = 1 or 2 bytes
- about 330 graphical chars
-
- Vendor 8 bit sets
- DEC-MCS
- HP Roman8
- IBM PC codepages (5)
- Uses also 128-159 (C1)
- IBM EBCDIC
- Many versions
- Not ASCII Compatible
-
- 16 bit char sets
- Japanese: JIS 0208, 0212
- Chinese: GB 1980
- Korean:
- Japanese 8/16 bit: Shift JIS
- Unicode: New vendor charset unifies CN, JP, KO sets
- Incompatible with ISO
-
- Multi-byte:
- EUC: Extended UNIX code
- ISO 2022 shifting
- SS1 SS2 SS3
- 4 char sets
- 8/16/24 bits
-
- 32 Bits:
- ISO 10646
- Also usable in 8, 16, or 24 bit compaction methods
- Proper encoding subsets: ASCII and ISO 8859-1
-
- Control Character Sets:
- ISO 646: 0-31, 127
- ISO 6429: 0-31, 127-159
- EBCDIC: as ISO 646
-
- Several ideas were batted around, including strict use of ISO2022,
- profiling language to character set mapping, and the use of
- "preferred" character sets. The working group felt that the best
- approach was to codify existing practice in the interim,pending
- adoption of an "international" character set. This existing
- practice was reduced to the following.
-
- If possible, use ISO 8859, with the lowest version number possible,
- i.e., use 8859-1 (Latin 1) over 8859-10? (Latin 5?). If the
- characters needed are not in the 8859 sets (i.e. Kanji)use the 2022
- character switching standard, declaring 2022 in the header of the
- document. While this may lead to the use of any of the many
- characters in the ECMA registry, the WG felt that in practice, only
- the current Oriental mail systems will use the2022 system and only
- with limited character sets.
-
- c) Use of Non-ASCII character sets in headers.
-
-
- What a mess! The attendees of this meeting spend over an hour
- working on various schemes for indicating character sets in the
- headers of a message other than ascii. It was identified as a
- requirement that the fields defined as TEXT be able to have
- variable character sets. While this goal was stated, no mechanism
- for the implementation was agreed upon.
-
- A modification of the BNF notation was suggested by Keld S.
-
- CHAR-EIGHT = <any Eight-bit character>; (0-377, 0.-255)
-
- qtext = <any CHAR-EIGHT excepting <">,"\" & CR, and
- including linear-white-space>
-
- quoted-pair = "\" CHAR-EIGHT
-
- text = <any CHAR-EIGHT, including bare CR & bare LF but
- NOT including CRLF>
-
-
- This notation was accepted by the attendees of the meeting, however
- several problems were identified and not resolved. 1)
- Identification of the header character set and the need to for
- conversion, and 2) Encoding the header character sets in 7 bit
- transport format.
-
- It was not clear how a conversion gateway would know that the
- header was 8 bit and needed encoding. A suggestion accepted by the
- group was that the use of the new BNF requires the use of a header-
- charset header line. This additional header adds complexity to
- user agents and conversion gateways by requiring two passes of the
- header to determine and convert the header into a passable or
- readable form. It was felt that this was inelegant but do-able.
-
- Several proposals were discussed for encoding the 8 bit text
- strings when 7 bit transport was required. It was accepted that
- this was a hard requirement.
-
- 1) Variable Substitution
-
- On proposal for the insertion of 8 bit text was to substitute
- a variable name in the header for each text string needing 8 bit
- characters. The variable could then be defined elsewhere in the
- header, including the encoded actual string and a token indicating
- the character set. This was rejected as messy and difficult to
- implement in current user agents.
-
- 2) Message Encapsulation
-
- Encapsulate the mail message using the message type body part and
- a suitable transport encoding, preferable quoted-printable. This
- proposal is controversial among at least one implementor of the
- message format standard as having excessive complexity for the user
- agent. It is not clear the encapsulated message will be permitted
- to have a transport encoding.
-
- 3) Encoded Text Fields
-
- This proposal would specify a standard encoding for the header
- fields, possibly quoted-readable or quoted-printable and identify
- this fact in a header-transport-encoding header or the header-
- character-set header.
-
- Conclusions
-
- While no one was happy, the group tentatively agreed to not permit
- 8 bit text in the headers. The only reasonable way to encode 7 bit
- text was to encode the text fields, and insert a new header line.
- With this overhead the group agreed that while not ideal, a
- requirement that extended character sets should always be encoded,
- eliminating the need for intermediate gateways to parse and convert
- the headers.
-
-